The dataset contains the following information on car properties:
data("mtcars") #loading dataset
head(mtcars) #brief view to content
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The dataset contanins 32 unique elements which represent car models. The dataset is small. and will not provide robust information
#distribution plots
ggplot(melt.data.frame(mtcars), aes (value)) +
geom_histogram(bins=15,aes(y=..density..)) + geom_density() +
facet_wrap(~variable,scales = "free")
## Using as id variables
Fig.1 Distribution plots of dataset variables (exploratory analysis)
Distributions of the variables are depicted in Fig. 1. Basing on the Fig. 1. it can be stated that variables cyl, gear, carb are of ordinal type. vs, am are nominal variables. Especially nominal variables should be taken into special consideration during clustering procedure and pattern identifications. In order to determine the possible linear correlations between pairs of variables the correlation matrix has been constructed. It is presented below in Tab. 1.
nc <- ncol(mtcars) #number of dataframe columns
hdr <- colnames(mtcars) # header (dataframe column names)
#correlation matrix
C <- round(cor(mtcars),2) #create correlation matrix
lowerTriangle(C) <- NA #for better readability purge lower triangular part
print(C)
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl NA 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp NA NA 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp NA NA NA 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat NA NA NA NA 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt NA NA NA NA NA 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec NA NA NA NA NA NA 1.00 0.74 -0.23 -0.21 -0.66
## vs NA NA NA NA NA NA NA 1.00 0.17 0.21 -0.57
## am NA NA NA NA NA NA NA NA 1.00 0.79 0.06
## gear NA NA NA NA NA NA NA NA NA 1.00 0.27
## carb NA NA NA NA NA NA NA NA NA NA 1.00
In Fig. 2. a scatterplot of all variables pairs is presented. Further graphical exploration of the data has been driven in order to choose the proper variables from the viewpoint of this exercise.
pairs(mtcars)
Fig.2. Scatter plots of dataset variables (exploratory analysis)
Choosing pairs of variables which have sufficient correlation level.
interesting_correlation_level <- 0.6
vlist <- list()
plist <- list()
k<-0
for (i in 1:nc)
{
if (i==nc) #yes! I am R and I don't care about being fast! I never have..., use while and stop complaining
vec <- NULL
else
vec <- (i+1):nc
for (j in vec){
if (abs(C[i,j]) > interesting_correlation_level)
{
#print(c(hdr[i],hdr[j]))
k <- k+1
vlist[[k]] <- c(hdr[i],hdr[j])
local ({
i<-i
j<-j
p <- ggplot(mtcars) + geom_point(aes( x= mtcars[hdr[i]], y= mtcars[hdr[j]] )) +xlab(hdr[i]) +ylab(hdr[j])
plist[[k]] <<- p}) #symbolic manipulation in R is somewhat strange!
}
}
}
length(vlist)
## [1] 26
Plotting them
multiplot(plotlist = plist[1:9], cols = 3) #yeah, manually! For Gods sake!
## Loading required package: grid
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
multiplot(plotlist = plist[10:18], cols = 3)
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
multiplot(plotlist = plist[19:26], cols = 3)
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
linvlist <- c(1,2,4,5,7,8,10,12,13,14,16,17,19,20,22,25) #list of potentially interesting linear correlations (chosen manually)
multiplot(plotlist = plist[linvlist], cols = 4) #multiplot definition in different source file
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
It is done manually in order to provide intuitive relation between variables. it is more reasonable that cyl variable has influence to mpg not the other way round.
lmodel <- lm(data=mtcars,mpg~cyl) #it should rather be Ist type regression
ggplot(data=mtcars,aes(x=cyl,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,mpg~disp)
ggplot(data=mtcars,aes(x=disp,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,mpg~drat)
ggplot(data=mtcars,aes(x=drat,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,mpg~wt)
ggplot(data=mtcars,aes(x=wt,y=mpg)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,disp~cyl)
ggplot(data=mtcars,aes(x=cyl,y=disp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,hp~cyl)
ggplot(data=mtcars,aes(x=cyl,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,wt~cyl)
ggplot(data=mtcars,aes(x=cyl,y=wt)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,hp~disp)
ggplot(data=mtcars,aes(x=disp,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,drat~disp)
ggplot(data=mtcars,aes(x=disp,y=drat)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,wt~disp)
ggplot(data=mtcars,aes(x=disp,y=wt)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,hp~wt)
ggplot(data=mtcars,aes(x=wt,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,qsec~hp)
ggplot(data=mtcars,aes(x=hp,y=qsec)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,hp~carb)
ggplot(data=mtcars,aes(x=carb,y=hp)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,drat~wt)
ggplot(data=mtcars,aes(x=wt,y=drat)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,drat~gear)
ggplot(data=mtcars,aes(x=gear,y=drat)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
lmodel <- lm(data=mtcars,qsec~carb)
ggplot(data=mtcars,aes(x=carb,y=qsec)) +geom_point() + geom_abline(slope = lmodel$coefficients[2], intercept = lmodel$coefficients[1])
To search for possible correlation dependencies PCA was performed on continuous and ordinal variables. Categorical variables were neglected.
gPCAreduced <- PCA(mtcars[c("mpg","disp","hp","drat","wt","qsec")], scale.unit = TRUE, ncp = 6, graph = TRUE)
No surprise - it only confirms linear regression
Trying to linearly reduce the datase dimensionality
dPCAreduced <- PCA(mtcars[c("mpg","disp","hp","drat","wt","qsec")], scale.unit = TRUE, ncp = 6, graph = FALSE)
dPCAreduced$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.18739648 69.7899413 69.78994
## comp 2 1.14811212 19.1352020 88.92514
## comp 3 0.33335666 5.5559444 94.48109
## comp 4 0.15436054 2.5726757 97.05376
## comp 5 0.12479601 2.0799335 99.13370
## comp 6 0.05197818 0.8663031 100.00000
The dataset can be reduced to 3 or 4 features.
#kPCA with some assumed params
kpc <- kpca(~.,data=mtcars,kernel="rbfdot",
kpar=list(sigma=0.2),features=ncol(mtcars))
#print the principal component vectors
PC <-pcv(kpc)
mtc = mtcars
mtc$pc1 <- PC[,1]
mtc$pc2 <- PC[,2]
mtc$pc3 <- PC[,3]
plot_ly(mtc, x = ~pc1, y = ~pc2, z = ~pc3) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")))
## Warning: package 'bindrcpp' was built under R version 3.4.4
kPCA Seems quite promising.